Managing power performance of distributed computing systems

ABSTRACT

A method of managing power and performance of a High-performance computing (HPC) systems, including: determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes is shown.

The present application claims the benefit of prior U.S. Provisional Patent Application No. 62/040,576, entitled “SIMPLE POWER-AWARE SCHEDULER TO LIMIT POWER CONSUMPTION BY HPC SYSTEM WITHIN A BUDGET” filed on Aug. 22, 2014, which is hereby incorporated by reference in its entirety.

The present application is related to the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P73498) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74562) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74563) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74564) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74565) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74566) entitled ______ filed ______; the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74568) entitled ______ filed ______; and the U.S. patent application Ser. No. ______ (Attorney Docket No. 42P74569) entitled “A POWER AWARE JOB SCHEDULER AND MANAGER FOR A DATA PROCESSING SYSTEM”, filed ______.

FIELD

Embodiments of the invention relate to the field of computer systems; and more specifically, to the methods and systems of power management and monitoring of high performance computing systems.

BACKGROUND

A High Performance Computing (HPC) system performs parallel computing by simultaneous use of multiple nodes to execute a computational assignment referred to as a job. Each node typically includes processors, memory, operating system, and I/O components. The nodes communicate with each other through a high speed network fabric and may use shared file systems or storage. The job is divided in thousands of parallel tasks distributed over thousands of nodes. These tasks synchronize with each other hundreds of times a second. Usually a HPC system can consume megawatts of power.

Growing usage of HPC systems in the recent years have made power management a concern in the industry. Future systems are expected to deliver higher performance while operating under a power constrained environment. However, current methods used to manage power and cooling in traditional servers cause a degradation of performance.

The most commonly used power management systems use an out of band mechanism to enforce both power allocation and system capacity limits Commonly used approaches to limit power usage of an HPC, such as Running Average Power Limit (RAPL), Node Manager (NM), and Datacenter Manager (DCM), use a power capping methodology. These power management systems define and enforce a power cap for each layer of HPC systems (e.g., Datacenter, Processors, Racks, Nodes, etc.) based on the limits. However, the power allocation in this methodology is not tailored to increase the performance. For example, Node Managers allocate equal power to the nodes within their power budget. However, if nodes under the same power conditions operate with different performance level such a variation in performance of the nodes results in degradation of the overall performance of the HPC system.

Furthermore, today's HPC facilities communicate their demand for power to utility companies months in advance. Lacking a proper monitoring mechanism to forecast power consumption, such demands are usually made equal to or greater than the maximum power for a worst case workload a facility can use. However, the actual power consumption is usually expected to be lower and so the unused power is wasted.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:

FIG. 1 illustrates an exemplary block diagram of an overall architecture of a power management and monitoring system in accordance with one embodiment.

FIG. 2 illustrates an exemplary block diagram of overall interaction architecture of HPC Power-Performance Manager in accordance with one embodiment.

FIG. 3 illustrates an exemplary block diagram showing an interaction between the HPC facility power manager and other component of the HPC facility.

FIG. 4 illustrates an exemplary block diagram showing an interaction between the HPC System Power Manager with a Rack Manager and a Node Manager.

FIG. 5 illustrates HPPM response mechanism at a node level in case of a power delivery or cooling failures.

FIG. 6 illustrates an exemplary block diagram of a HPC system receiving various policy instructions.

FIG. 7 illustrates an exemplary block diagram showing the interaction between the HPC Resource Manager and other components of the HPC System.

FIG. 8 illustrates an exemplary block diagram of the interaction of the Job Manager with Power Aware Job Launcher according to power performance policies.

FIG. 9 illustrates one embodiment of a process for power management and monitoring of high performance computing systems.

FIG. 10 illustrates another embodiment of a process for power management and monitoring of high performance computing systems.

DESCRIPTION OF EMBODIMENTS

The following description describes methods and apparatuses for power management and monitoring of high performance computing systems. In the following description, numerous specific details such as specific power policies, particular power management devices, and etc. are set forth in order to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details.

References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

As discussed above, embodiments described herein relate to the power management and monitoring for high performance computing systems. According to various embodiments of the invention, a frame work for workload aware, hierarchical and holistic management and monitoring for power and performance is disclosed.

FIG. 1 illustrates an example of power management and monitoring system for HPC systems according to one embodiment. The system is referred to herein as an HPC Power-Performance Manager (HPPM). In this example, HPC System 400 includes multiple components including Resource Manager 410, Job Manager 420, Datacenter Manager 310, Rack Manager 430, Node Manager 431, and Thermal Control 432. In one embodiment, HPPM receives numerous power performance policies input at different stages of the management. In one embodiment, power performance policies include a facility policy, a utility provider policy, a facility administrative policy, and a user policy.

HPC System Power Manager 300 communicates the capacity and requirements of HPC System 400 to HPC Facility Power Manager 200. HPC Facility Power Manager 200 then communicates the power allocated by the utility provider back to HPC System Power Manager 300. In one embodiment, HPC System Power Manager 300 also receives administrative policies from HPC System Administrator 202.

In order to properly allocate power to HPC System 400, in one embodiment HPC System Power Manager 300 receives the power and thermal capacity of HPC System 400 and maintains the average power consumption of HPC System 400 at or below the allocation. In one embodiment, a soft limit is defined in part by the power available for the allocation. In one embodiment, the soft limit includes the power allocated to each HPC system within HPPM and the power allocated to each job. In one embodiment, the job manager 420 enforces the soft limit to each job based on the power consumption of each node.

Furthermore, the power consumption of HPC System 400 never exceeds the power and thermal capacity of the cooling and power delivery infrastructures. In one embodiment, a hard limit is defined by the power and thermal capacity of the cooling and power delivery infrastructures. In one embodiment, hard limit defines power and cooling capability available for the nodes, racks, systems and datacenters within a HPC facility. The cooling and power infrastructures may or may not be shared by different elements of the HPC facility. In one embodiment, the hard limit fluctuates in response to failures in cooling and power delivery infrastructures, while the soft limit remains at or below the hard limit at any time.

HPC System Power Manager 300 uses Out of Band mechanism 301 (e.g., Node Manager 431, Thermal Control 432, Rack Manager 430 and Datacenter Manager 310) to monitor and manage the hard limit for each component. In one embodiment, the Out of Band mechanism 301, unlike In Band mechanism 302, uses an independent embedded controller outside the system with an independent networking capability to perform its operation.

To maintain the power consumption of HPC System 400 within the limits (both the hard limit and the soft limit) and to increase energy efficiency, HPC System Power Manager 300 allocates power to the jobs. In one embodiment, the allocation of power to the jobs is based on the dynamic monitoring and power-aware management of Resource Manager 410 and Job Manager 420 further described below. In one embodiment, Resource Manager 410 and Job Manager 420 are operated by In-Band mechanism 302. In one embodiment, In Band mechanism 301 uses system network and software for monitoring, communication, and execution.

An advantage of embodiments described herein is that the power consumption is managed by allocating power to the jobs. As such, the power consumption is allocated in a way to cause significant reduction in the performance variations of the nodes and subsequently improvement in job completion time. In other words, the power allocated to a particular job is distributed among the nodes dedicated to run the job in such a way to achieve the increased performance.

FIG. 2 illustrates an example of interactions between different components of HPC Power-Performance Manager 100. It is pointed out that those elements of FIG. 2 having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such. The lines connecting the blocks represent communication between different components of a HPPM.

In one embodiment, these communications include communicating, for example, the soft and hard limits for each component of the HPPM 100, reporting the power and thermal status of the components, reporting failures of power and thermal infrastructures, and communicating the available power for the components, etc. In one embodiment, HPPM 100 includes multiple components divided between multiple datacenters within a HPC facility. HPPM 100 also includes power and cooling resources shared by the components. In one embodiment, each datacenter includes a plurality of sever racks, and each server rack includes a plurality of nodes.

In one embodiment, HPPM 100 manages power and performance of the system by forming a dynamic hierarchical management and monitoring structure. The power and thermal status of each layer is regularly monitored by a managing component and reported to a higher layer. The managing component of the higher layer aggregates the power and thermal conditions of its lower components and reports it to its higher layer. Reversely, the higher managing component ensures the allocation of power to its lower layers is based upon the current power and thermal capacity of their components.

For example, in one embodiment, HPC Facility Power Manager 200 distributes power to multiple datacenters and resources shared within the HPC facility. HPC Facility Power Manager 200 receives the aggregated report of the power and thermal conditions of the HPC facility from Datacenter Manager 210. In one embodiment, Datacenter Manager 210 is the highest managing component of HPPM 100. Datacenter Manager 210 is the higher managing component of plurality of datacenters. Each datacenter is managed by a datacenter manager, such as for example, Datacenter Manager 310. Datacenter Manager 310 is the higher managing component of a plurality of server racks. Each server rack includes plurality of nodes. In one embodiment, Datacenter Manager 310 is a managing component for the nodes of an entire or part of a server rack while in other embodiments Datacenter Manager 301 is a managing component for nodes of multiple racks. Each node is managed by a node manager. For example, each of Nodes 500 is managed by Node Manager 431. Node Manager 431 monitors and manages power consumption and thermal status of its associated node.

Datacenter Manager 310 is also a higher managing component for the power and cooling resources shared by a plurality of the nodes. Each shared power and cooling resource is managed by a rack manager, for example the Rack Manager 430. In one embodiment, plurality of nodes share multiple power and cooling resources each managed by a rack manager. In one embodiment, HPC Facility Power Manager 200 sends the capacity and requirements of the HPC facility to a utility provider. HPC Facility Power Manager 200 distributes the power budget to HPC System Power Manager associated with each HPC System (e.g., the HPC System Power Manager 300). HPC System Power Manager 300 determines how much power to allocate to each job. Job Manager 420 manages power performance of a job within the budget allocated by the HPC System Power Manager 300. Job Manager 420 manages a job throughout its life cycle by controlling the power allocation and frequencies of Nodes 500.

In one embodiment, if a power or thermal failure occurs on any lower layers of Datacenter Manager 310, Datacenter Manager 310 immediately warns HPC System Power Manager 300 of the change in power or thermal capacity. Subsequently, HPC System Power Manager 300 adjusts the power consumption of the HPC system by changing the power allocation to the jobs.

FIG. 3 demonstrates the role of HPC Facility Power Manager 200 in more details. It is pointed out that those elements of FIG. 3 having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such. In one embodiment, HPC Facility 101 includes HPC Facility Power Manager 200, Power Generator and Storage 210, Power Convertor 220, Cooling System 230 that may include storage of a cooling medium, and several HPC systems including the HPC System 400. Each HPC system is managed by a HPC System Power Manager (e.g., HPC System Power Manager 300 manages HPC System 400).

In one embodiment, HPC Facility Power Manager 200 manages the power consumption of HPC Facility 101. HPC Facility Power Manager 200 receives facility level policies from the Facility Administrator 102. In one embodiment, the facility level policies relate to selecting a local source of power, environmental considerations, and the overall operation policy of the facility. HPC Facility Power Manager 200 also communicates with Utility Provider 103. In one embodiment, HPC Facility Power Manager 200 communicates its forecasted capacity and requirements of HPC Facility 101 in advance to the Utility Provider 103. In one embodiment HPC Facility 101 uses Demand/Response interface to communicate with Utility Provider 103.

In one embodiment, the Demand/Response interface provides a non-proprietary interface that allows the Utility Provider 103 to send signals about electricity price and system grid reliability directly to customers, e.g. HPC Facility 101. The dynamic monitoring allows for HPC Facility Power Manager 200 to more accurately estimate the required power and communicate its capacity and requirement automatically to Utility Provider 103. This method allows for improving cost based on the price in real time and reduces the disparity between the allocated power by the Utility Provider 103 and the power actually used by the Facility 101.

In one embodiment, HPPM determines a power budget at a given time based upon the available power from Utility Provider 103, the cost of the power from Utility Provider 103, the available power in the local Power Generator and Storage 210, and actual demand by the HPC systems. In one embodiment, HPPM substitutes the energy from the utility provider by the energy from the local storages or electricity generators. In one embodiment, HPPM receives the current price of electricity and makes the electricity produced by Power Generator and Storage 210 available for sell in the market.

FIG. 4 illustrates how HPC System Power Manager 300 manages shared power supply among nodes using a combination of Rack Manager 430 and Node Manager 440. It is pointed out that those elements of FIG. 4 having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such.

In one embodiment, Rack Manager 430 reports the status of the shared resources and receives power limits from Datacenter Manager 310. Node Manager 440 reports node power consumption and receive node power limits from Datacenter Manager 310. Similarly, Datacenter Manager 310 reports system power consumption to HPC System Power Manager 300. The communication between HPC System Power Manager 300 and Datacenter Manager 310 facilitates monitoring of the cooling and power delivery infrastructure in order to maintain the power consumption within the hard limit. In one embodiment, HPC System Power Manager 300 maintains the power consumption of the nodes or processors by adjusting the power allocated to them.

In one embodiment, in case failure of power supply or cooling systems results in a sudden reduction of available power, the hard limit is reduced automatically by either or both of Rack Manager 430 and Node Manager 440 to a lower limit to avoid a complete failure of the power supply. Subsequently the sudden reduction of available power is reported to HPC System Power Manger 300 through Datacenter Manager 310 by either or both of Rack Manager 430 and Node Manager 440, so that HPC System Power Manger 300 can readjust the power allocation accordingly.

FIG. 5 illustrates HPPM response mechanism at a node level in case of a power delivery or cooling failures. It is pointed out that those elements of FIG. 5 having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such.

In one embodiment, a cooling and power delivery failure does not impact all nodes equally. Once Node Manager 431 identifies the impacted nodes, for example Nodes 500, it will adjust the associated hard limit for Nodes 500. This hard limit is then communicated to Job Manager 420. Job Manager 420 adjusts the soft limit associated with Nodes 500 to maintain both soft limit and power consumption of Nodes 500 at or below the hard limit. In one embodiment, the frequency of the communication between Node Manager 431 and Job Manager 420 is in milliseconds.

In one embodiment, a faster response is required to avoid further power failure of the system. As such, Node Manager 431 directly alerts Nodes 500. The alert imposes a restriction on Nodes 500 and causes an immediate reduction of power consumption by Nodes 500. In one embodiment, such a reduction could be more than necessary to avoid further power failures. Subsequently Node Manager 431 communicates the new hard limit to Job Manager 420. Job Manager 420 adjusts the soft limits of Nodes 500 to maintain the power consumption of Nodes 500 at or below the hard limit Job Manager 420 enforces the new hard limit and removes the alert asserted by Node Manager 431.

Referring to FIG. 6, an exemplary block diagram of a HPC system receiving various inputs is illustrated. It is pointed out that those elements of FIG. 6 having the same reference numbers (or names) as the elements of any other figure can operate or function in any manner similar to that described, but are not limited to such. In one embodiment described herein, HPC system 400 includes one or more operating system (OS) nodes 501, one or more compute nodes 502, one or more input/output (I/O) nodes 503 and a storage system 504. The high-speed fabric 505 communicatively connects the OS nodes 501, compute nodes 502 and I/O nodes 503 and storage system 504 The high-speed fabric may be a network topology of nodes interconnected via one or more switches. In one embodiment, as illustrated in FIG. 6, I/O nodes 503 are communicatively connected to storage 504. In one embodiment, storage 504 is a non-persistent storage such as volatile memory (e.g., any type of random access memory “RAM”); persistent storage such as non-volatile memory (e.g., read-only memory “ROM”, power-backed RAM, flash memory, phase-change memory, etc.), a solid-state drive, hard disk drive, an optical disc drive, or a portable memory device.

The OS nodes 501 provide a gateway to accessing the compute nodes 502. For example, prior to submitting a job for processing on the compute nodes 502, a user may be required to log-in to HPC system 400 which may be through OS nodes 501. In embodiments described herein, OS nodes 501 accept jobs submitted by users and assist in the launching and managing of jobs being processed by compute nodes 502.

In one embodiment, compute nodes 502 provide the bulk of the processing and computational power. I/O nodes 503 provides an interface between compute nodes 502 and external devices (e.g., separate computers) that provides input to HPC system 400 or receive output from HPC system 400.

The limited power allocated to HPC system 400 is used by HPC system 400 to run one or more of jobs 520. Jobs 520 comprise one or more jobs requested to be run on HPC system 400 by one or more users, for example User 201. Each job includes a power policy, which will be discussed in-depth below. The power policy will assist the HPC System Power Manager in allocating power for the job and aid in the management of the one or more jobs 520 being run by HPC system 400.

In addition, HPC System Administrator 202 provides administrative policies to guide the management of running jobs 520 by providing an over-arching policy that defines the operation of HPC system 400. In one embodiment, examples of policies in the administrative policies include, but are not limited or restricted to, (1) a policy to increase utilization of all hardware and software resources (e.g., instead of running fewer jobs at high power and leaving resources unused, run as many jobs as possible to use as much of the resources as possible); (2) a job with no power limit is given the highest priority among all running jobs; and/or (3) suspended jobs are at higher priority for resumption. Such administrative policies govern the way the HPC System Power Manager schedules, launches, suspends and re-launches one or more jobs.

User 201 policy can be specific to a particular job. User 201 can instruct HPC System 400 to run a particular job with no power limit or according to a customized policy. Additionally User 201 can set the energy policy of a particular job, for example at most efficiency or highest performance.

As shown in FIG. 1, HPC System Administrator 202 and User 201 communicate their policies to the HPC System Power Manager 300 and Resource Manager 410. In one embodiment, Resource Manager 410 receives these policies and formulates them into “modes” under which Job Manager 420 instructs OS Nodes 501, CPU Nodes 502, and IO Node 503 to operate.

FIG. 7 shows the flow of information between Resource Manager 410 (including Power Aware Job scheduler 411, and Power Aware Job launcher 412) and other elements of the HPPM (HPC System Power Manager 300, Estimator 413, Calibrator 414, and Job Manager 420). In one embodiment, the purpose of these communications is to allocate sufficient hardware resources (e.g., nodes, processors, memories, network bandwidth and etc.) and schedule execution of appropriate jobs. In one embodiment, power is allocated to the jobs in such a way to maintain HPC System 400 power within the limits, increase energy efficiency, and control HPC system 400 rate of power consumption change.

Referring to FIG. 6, to determine amount of power allocation to each job, HPC System Power Manager 300 communicates with Resource Manager 410. Power Aware Job scheduler 411 considers the policies and priorities of Facility Administrator 102, Utility Provider 103, User 201, and HPC System Administrator 202 and determines accordingly what hardware resources of HPC System 400 is needed to run a particular job. Additionally, Power Aware Job scheduler 411 receives power-performance characteristics of the job at different operating points from Estimator 413 and Calibrator 414. Resource Manager 410 forecasts how much power a particular job needs and take corrective actions when actual power differs from the estimation.

Estimator 413 provides Resource Manager 410 with estimates of power consumption for each job enabling Resource Manager 410 to efficiently schedule and monitor each job requested by one or more job owners (e.g., users). Estimator 413 provides a power consumption estimate based on, for example, maximum and average power values stored in a calibration database, wherein the calibration database is populated by the processing of Calibrator 414. In addition, the minimum power required for each job is considered. Other factors that is used by Estimator 413 to create a power consumption estimate include, but are not limited or restricted to, whether the owner of the job permits the job to be subject to a power limit, the job power policy limiting the power supplied to the job (e.g., a predetermined fixed frequency at which the job will run, a minimum power required for the job, or varying frequencies and/or power supplied determined by Resource Manager 410), the startup power for the job, the frequency at which the job will run, the available power to HPC System 400 and/or the allocated power to HPC System 400.

Calibrator 414 calibrates the power, thermal dissipation and performance of each node within HPC System 400. Calibrator 414 provides a plurality of methods for calibrating the nodes within HPC system 400. In one embodiment, Calibrator 414 provides a first method of calibration in which every node within HPC system 400 runs sample workloads (e.g., a mini-application and/or a test script) so Calibrator 414 may sample various parameters (e.g., power consumed) at predetermined time intervals in order to determine, inter alia, (1) the average power, (2) the maximum power, and (3) the minimum power for each node. In addition, the sample workload is run on each node at every operating frequency of the node. In another embodiment, Calibrator 414 provides a second method of calibration in which calibration of one or more nodes occurs during the run-time of a job. In such a situation, Calibrator 414 samples the one or more nodes on which a job is running (e.g., processing). In the second method, Calibrator 414 obtains power measurements of each node during actual run-time.

In one embodiment, Power Aware Job Scheduler 411 is configured to receive a selection of a mode for a job, to determine an available power for the job based on the mode and to allocate a power for the job based on the available power. In one embodiment, Power Aware Job Scheduler 411 is configured to determine a uniform frequency for the job based on the available power. In one embodiment, the power aware job scheduler is configured to determine the available power for the job based on at least one of a monitored power, an estimated power, and a calibrated power.

Generally, a user submits a program to be executed (“job”) to a queue. The job queue refers to a data structure containing jobs to run. In one embodiment, Power Aware Job Scheduler 411 examines the job queue at appropriate times (periodically or at certain events e.g., termination of previously running jobs) and determines if resources including the power needed to run the job can be allocated. In some cases, such resources can be allocated only at a future time, and in such cases the job is scheduled to run at a designated time in future. Power Aware Job Launcher 412 selects a job among the jobs in the queue, based on available resources and priority, and schedules it to be launched. In one embodiment, in case the available power is limited, Power Aware Job Launcher 412 will look at the operating points to select the one which results in highest frequency while maintain the power consumption below the limit.

FIG. 8 illustrates the interaction of Job Manager 420 with Power Aware Job Launcher 412 according to Power Performance Policies 440. Once a job is launched, it is assigned a job manager, for example Job Manager 420. Job Manager 420 manages power performance of the job throughout its life cycle. In one embodiment, Job Manager 420 is responsible for operating the job within the constraints of one or more power policies and various power limits after the job has been launched. In one embodiment, for example, a user may designate “special” jobs that are not power limited. Power Aware Job scheduler 411 will need to estimate the maximum power the job could consume, and only start the job when the power is available. System Power Performance 300 redistributes power among the normal jobs in order to reduce stranded power and increase efficiency. But even if the allocated power for HPC System 400 falls, the workload manager ensures that these “special” jobs' power allocations remain intact. In another example, a user may specify the frequency for a particular job. In one embodiment, user selection may be based upon a table that indicates degradation in performance and reduction in power for each frequency.

Alternatively, the frequency selection for the jobs can be automated based upon available power. In one embodiment, with dynamic power monitoring, Job Manager 420 will adjust the frequency periodically based upon power headroom. An advantage of embodiment described herein is that a job will be allowed to operate at all available frequencies. Job Manager 420 will determine the best mode to run the job based upon the policies and priorities communicated by Facility Administrator 102, Utility Provider 103, User 201, and HPC System Administrator 202.

FIG. 9 is a flow diagram of one embodiment of a process for managing power and performance of HPC systems. The process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware or a combination of the three.

Referring to FIG. 9, at block 901, HPPM communicates capacity and requirements of the HPC system to a utility provider. In one embodiment, the capacity of the HPC system is determined based on the cooling and power delivery capacity of the HPC system. In one embodiment, the HPPM communicates its capacity and requirements to the utility provider through a demand/response interface. In one embodiment, the demand/response interface reduces a cost for the power budget based on the capacity and requirements of the HPC system and input from the utility provider. In one embodiment the demand/response interface communicates the capacity and requirements of the HPC system through an automated mechanism.

At block 902, HPPM determines a power budget for the HPC system. In one embodiment, the power budget is determined based on the cooling and power delivery capacity of the HPC system. In one embodiment, the power budget is determined based on the power performance policies. In one embodiment, the power performance policies based on at least one of a facility policy, a utility provider policy, a facility administrative policy, and a user policy.

At block 903, HPPM determines a power and cooling capacity of the HPC system. In one embodiment, determining the power and cooling capacity of the HPC system includes monitoring and reporting failures of power delivery and cooling infrastructures. In one embodiment, in case of a failure the power consumption is adjusted accordingly. In one embodiment, determining the power and cooling capacity of the HPC system is performed by an out of band mechanism.

At block 904, HPPM allocates the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system. In one embodiment, allocating the power budget to the job is based on power performance policies. In one embodiment, allocating the power budget to the job is based on an estimate of power required to execute the job. In one embodiment, the estimate of the required power to execute the job is based on at least one of a monitored power, an estimated power, and a calibrated power.

At block 905, HPPM executes the job on selected HPC nodes. In one embodiment, the selected HPC nodes are selected based on power performance policies. In one embodiment, the selected HPC nodes are selected based on power characteristics of the nodes. In one embodiment, the power characteristics of the HPC nodes are determined based on running of a sample workload. In one embodiment, the power characteristics of the HPC nodes are determined during runtime. In one embodiment, wherein the job is executed on the selected HPC nodes based on power performance policies.

FIG. 10 is a flow diagram of one embodiment of a process for managing power and performance of HPC systems. The process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware or a combination of the three. Referring to FIG. 10, at block 1001, HPPM defines a hard power limit based on a thermal and power delivery capacity of a HPC facility. In one embodiment, the hard power limit is managed and monitored by an out of band mechanism. In one embodiment, the hard power limit decreases in response to failures of the power and cooling infrastructures of the HPC facility.

At block 1002, HPPM defines a soft power limit based on a power budget allocated to the HPC facility. In one embodiment, the power budget for the HPC facility is provided by a utility provider through a demand/response interface. In one embodiment, the demand/response interface reduces a cost for the power budget based on the capacity and requirements of the HPC system and input from the utility provider. In one embodiment the demand/response interface communicates the capacity and requirements of the HPC system through an automated mechanism.

At block 1003, HPPM allocates the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit. In one embodiment, allocating the power budget to the job is based on power performance policies. In one embodiment, allocating the power budget to the job is based on an estimate of power required to execute the job. In one embodiment, the estimate of the required power to execute the job is based on at least one of a monitored power, an estimated power, and a calibrated power.

At block 1004, HPPM executes the job on nodes while maintaining the soft power limit at or below the hard power limit. In one embodiment, allocating the power budget to the job and executing the job on the nodes is according to power performance policies. In one embodiment, the power performance policies are based on at least one of a HPC facility policy, a utility provider policy, a HPC administrative policy, and a user policy.

Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of transactions on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of transactions leading to a desired result. The transactions are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method transactions. The required structure for a variety of these systems will appear from the description above. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the invention as described herein.

In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Throughout the description, embodiments of the present invention have been presented through flow diagrams. It will be appreciated that the order of transactions and transactions described in these flow diagrams are only intended for illustrative purposes and not intended as a limitation of the present invention. One having ordinary skill in the art would recognize that variations can be made to the flow diagrams without departing from the broader spirit and scope of the invention as set forth in the following claims.

The following examples pertain to further embodiments:

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes are selected based on power performance policies.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes are selected based on power characteristics of the nodes.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes are selected based on power characteristics of the nodes determined based on running of a sample workload.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes are selected based on power characteristics of the nodes determined during runtime.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the job is executed on the selected HPC nodes based on power performance policies.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein allocating the power budget to the job is based on power performance policies.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the allocating the power budget to the job is based on an estimate of power required to execute the job.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the allocating the power budget to the job is based on an estimate of power required to execute the job determined based on at least one of a monitored power, an estimated power, and a calibrated power.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power budget for the HPC system is based on the power and cooling capacity of the HPC system.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power budget for the HPC system is performed by communicating to a utility provider through a demand/response interface. In one embodiment, the demand/response interface reduces a cost for the power budget based on the capacity and requirements of the HPC system and inputs from the utility provider. In one embodiment, the demand/response interface communicates the capacity and requirements of the HPC system through an automated mechanism.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power and cooling capacity of the HPC system includes monitoring and reporting failures of power delivery and cooling infrastructures. In one embodiment, the method further comprises of adjusting the power consumption of the HPC system in response to the failure of the power and cooling infrastructures.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining a power and cooling capacity of the HPC system is performed by an out of band mechanism.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, defining a hard power limit based on a thermal and power delivery capacity of a HPC facility, wherein the HPC facility includes plurality of HPC systems, and the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, defining a soft power limit based on a power budget allocated to the HPC facility, allocating the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit, executing the job on nodes while maintaining the soft power limit at or below the hard power limit, and allocating the power budget to the job and executing the job on the nodes according to power performance policies.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, defining a hard power limit based on a thermal and power delivery capacity of a HPC facility, wherein the HPC facility includes plurality of HPC systems, and the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, defining a soft power limit based on a power budget allocated to the HPC facility, allocating the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit, executing the job on nodes while maintaining the soft power limit at or below the hard power limit, and allocating the power budget to the job and executing the job on the nodes according to power performance policies, wherein the hard power limit decreases in response to failures of the power and cooling infrastructures of the HPC facility.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, defining a hard power limit based on a thermal and power delivery capacity of a HPC facility, wherein the HPC facility includes plurality of HPC systems, and the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, defining a soft power limit based on a power budget allocated to the HPC facility, allocating the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit, executing the job on nodes while maintaining the soft power limit at or below the hard power limit, and allocating the power budget to the job and executing the job on the nodes according to power performance policies, wherein allocating the power budget to the job is based on an estimate of a required power to execute the job.

A method of managing power and performance of a High-performance computing (HPC) system, comprising, defining a hard power limit based on a thermal and power delivery capacity of a HPC facility, wherein the HPC facility includes plurality of HPC systems, and the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, defining a soft power limit based on a power budget allocated to the HPC facility, allocating the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit, executing the job on nodes while maintaining the soft power limit at or below the hard power limit, and allocating the power budget to the job and executing the job on the nodes according to power performance policies, wherein the hard power limit is managed by an out of band mechanism. In one embodiment, the power performance policies is based on at least one of a HPC facility policy, a utility provider policy, a HPC administrative policy, and a user policy.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes to execute the job are selected based in part by power performance policies.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the selected HPC nodes to execute the job are selected based in part by a power characteristics of the nodes. In one embodiment, the power characteristics of the HPC nodes are determined upon running a sample workload. In another embodiment, the power characteristics of the HPC nodes are determined during an actual runtime.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the job is executed on the selected HPC nodes based in part upon a power performance policies.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein allocating the power budget to the job is based in part upon a power performance policies

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein the allocating the power budget to the job is based in part on an estimate of a required power to execute the job. In one embodiment, the estimate of the required power to execute the job is in part based upon at least one of a monitored power, an estimated power, and a calibrated power.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power budget for the HPC system is in part based upon the power and cooling capacity of the HPC system.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power budget for the HPC system is performed in part by communicating to a utility provider through a demand/response interface. In one embodiment, the demand/response interface reduces a cost for the power budget based on the capacity and requirements of the HPC system and inputs from the utility provider. In one embodiment, the demand/response interface communicates the capacity and requirements of the HPC system through an automated mechanism.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining the power and cooling capacity of the HPC system includes monitoring and reporting failures of power delivery and cooling infrastructures. In one embodiment, the method further comprises adjusting the power consumption of the HPC system in response to the failure of the power and cooling infrastructures.

A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising, determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, determining a power and cooling capacity of the HPC system, allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system, and executing the job on selected HPC nodes, wherein determining a power and cooling capacity of the HPC system is performed by an out of band system.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the HPC facility manager, the HPC system manager, and the job manager are governed by power performance policies. In one embodiment, the power performance policies are in part based upon at least one of a HPC facility policy, a utility provider policy, a HPC administrative policy, and a user policy.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the HPC system manager selects the selected HPC nodes to execute the job based in part by a power characteristics of the nodes. In one embodiment, a calibrator runs a sample workload on the HPC nodes and reports the power characteristics of the HPC nodes to the HPC system manager. In another embodiment, a calibrator determines the power characteristics of the HPC nodes during an actual runtime and reports it to the HPC system manager.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the HPC system manager allocates power to the job based in part by an estimated power required to run the job. In one embodiment, an estimator calculates the estimated power required to run the job in part based upon at least one of a monitored power, an estimated power, and a calibrated power.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the out of band mechanism monitors and reports failures of power delivery and cooling infrastructures of a HPC facility to the HPC facility manager.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the out of band mechanism monitors and reports failures of power delivery and cooling infrastructures of the HPC system to the HPC system manager.

A system for managing power and performance of a High-performance computing (HPC) system, comprising, a HPC facility manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job, an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC system manager, the HPC system manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system, and a job manager to execute the job on selected nodes, wherein the HPC facility manager communicates capacity and requirements of the HPC system to a utility provider through a demand/response interface. In one embodiment, the demand/response interface reduces a cost for the power budget based on the capacity and requirements of the HPC system and inputs from the utility provider. In one embodiment, the demand/response interface communicates the capacity and requirements of the HPC system through an automated mechanism.

In the foregoing specification, methods and apparatuses have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of embodiments as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense. 

What is claimed is:
 1. A method of managing power and performance of a High-performance computing (HPC) system, comprising: determining a power budget for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job; determining a power and cooling capacity of the HPC system; allocating the power budget to the job to maintain a power consumption of the HPC system within the power budget and the power and cooling capacity of the HPC system; and executing the job on selected HPC nodes.
 2. The methods of claim 1, wherein the selected HPC nodes are selected based on power characteristics of the nodes.
 3. The methods of claim 2, wherein the power characteristics of the HPC nodes are determined based on running of sample workloads.
 4. The methods of claim 1, wherein the allocating the power budget to the job is based on an estimate of power required to execute the job.
 5. The methods of claim 4, wherein the estimate of the required power to execute the job is based on at least one of a monitored power, an estimated power, and a calibrated power.
 6. The methods of claim 1, wherein determining the power budget for the HPC system is performed by communicating to a utility provider through a demand/response interface.
 7. The methods of claim 1, wherein determining the power and cooling capacity of the HPC system includes monitoring and reporting failures of power delivery and cooling infrastructures.
 8. The methods of claim 7 further comprising adjusting the power consumption of the HPC system in response to the failure of the power and cooling infrastructures.
 9. The methods of claim 1, wherein the allocating the power budget to the job and executing the job on selected HPC nodes are governed by power performance policies.
 10. A method of managing power and performance of a High-performance computing (HPC) system, comprising: defining a hard power limit based on a thermal and power delivery capacity of a HPC facility, wherein the HPC facility includes plurality of HPC systems, and the HPC system includes a plurality of interconnected HPC nodes operable to execute a job; defining a soft power limit based on a power budget allocated to the HPC facility; allocating the power budget to the job to maintain an average power consumption of the HPC facility below the soft power limit; executing the job on nodes while maintaining the soft power limit at or below the hard power limit; and allocating the power budget to the job and executing the job on the nodes according to power performance policies.
 11. The methods of claim 10, wherein the hard power limit decreases in response to failures of the power and cooling infrastructures of the HPC facility.
 12. The methods of claim 10, wherein allocating the power budget to the job is based on an estimate of a required power to execute the job.
 13. The methods of claim 10, wherein the power performance policies is based on at least one of a HPC facility policy, a utility provider policy, a HPC administrative policy, and a user policy.
 14. A computer readable medium having stored thereon sequences of instruction which are executable by a system, and which, when executed by the system, cause the system to perform a method, comprising: determining a power budge for a HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job; determining a power and cooling capacity of the HPC system; allocating the power budget to the job such that a power consumption of the HPC system stays within the power budget and the power and cooling capacity of the HPC system; and executing the job on selected HPC nodes.
 15. The computer readable medium of claim 14, wherein the selected HPC nodes to execute the job are selected based in part by a power characteristics of the nodes.
 16. The computer readable medium of claim 15, wherein the power characteristics of the HPC nodes are determined upon running a sample workload.
 17. The computer readable medium of claim 14, wherein the allocating the power budget to the job is based in part on an estimate of a required power to execute the job.
 18. The computer readable medium of claim 17, wherein the estimate of the required power to execute the job is in part based upon at least one of a monitored power, an estimated power, and a calibrated power.
 19. The computer readable medium of claim 14, wherein determining the power budget for the HPC system is performed in part by communicating to a utility provider through a demand/response interface.
 20. The computer readable medium of claim 14, wherein the allocating the power budget to the job and executing the job on selected HPC nodes are governed by power performance policies.
 21. A system for managing power and performance of a High-performance computing (HPC) system, comprising: a HPC Facility Power Manager to determine a power budget for the HPC system, wherein the HPC system includes a plurality of interconnected HPC nodes operable to execute a job; an out of band mechanism to monitor and report a cooling and power capacity of the HPC system to a HPC System Power Manager; the HPC System Power Manager to allocate the power budge to the job within limitations of the cooling and power capacity of the HPC system; a job manager to execute the job on selected nodes.
 22. The system of claim 21, wherein the HPC System Power Manager selects the selected HPC nodes to execute the job based in part by a power characteristics of the nodes.
 23. The system of claim 21, wherein the HPC System Power Manager allocates power to the job based in part by an estimated power required to run the job.
 24. The system of claim 21, wherein the out of band mechanism monitors and reports failures of power delivery and cooling infrastructures of the HPC system to the HPC System Power Manager.
 25. The system of claim 21, wherein the HPC Facility Power Manager communicates capacity and requirements of the HPC system to a utility provider through a demand/response interface. 